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In this work, we consider the GPU implementation of the steepest descent method with Fourier 

acceleration for Laudau gauge fixing, using CUDA. The performance of the code in a Tesla C2070 

(~^) GPU is compared with a parallel CPU implementation. 
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On the lattice, Landau gauge is defined through the maximization of the functional 
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where Nd is the dimension of the space-time, 7V C is the dimension of the gauge group and V the 
lattice volume, on each gauge orbit. The functional Fy [g] can be maximised using a steepest descent 
method [1,2]. However, when the method is applied to large lattice volumes, it faces the problem 
of critical slowing down, which can be attenuated by Fourier acceleration. 

The main goal of this work is to compare the difference in performance between GPU and CPU 
implementations of the Fourier accelerated Landau gauge fixing method. The MPI parallel version 
of the algorithm was implemented in C++, using the machinery provided by the Chroma library [3] ; 
for the Fourier transforms, the code uses PFFT, a parallel FFT library written by Michael Pippig 
[4]. For the GPU code, we used version 4.1 of CUDA [5] - see also [6]; FFT are performed using 
the CUFFT library by NVIDIA [7]. 

For such a comparison, we use a NVIDIA Tesla C2070. The GPU code has been run using 
a 12 real number representation; furthermore, we used texture memory and we switched ECC off. 
The CPU code has been run in the Centaurus cluster, at Coimbra. In Centaurus, each node has 
2 Intel Xeon E5620@2.4 GHz (quad core), with 24 GB of RAM, and it is equipped with a DDR 
Infiniband network. 

In order to compare the performance of the two codes, we used a 32 4 lattice volume. The 
configurations have been generated using the Wilson gauge action, with three different values of 
j8. The runs used a = 0.08 and 6 < 10~ 15 . 
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Figure 1: Strong scaling CPU tested for a 32 4 lattice volume and comparison with the GPU for the 
best performance, 12 real number parametrization, ECC Off and using texture memory in double 
precision, [8]. In Centaurus, a cluster node means 8 computing cores. 
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In figure 1 we compare the performance of the parallel CPU code against the GPU result. The 
CPU code shows a good strong scaling behaviour, with a linear speed-up against the number of 
computing nodes. However, the GPU code was much faster: in order to reproduce the performance 
of the GPU code, one needs 256 CPU cores. 

For more details on this work, please see [8, 9, 10]. 
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